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The primary di&tinction between evaluation and 'either empirically- 
based activities Identified as research is that evaluation serves decision 
making, or is decision- oriente^d, \yhile other research is conclusion-ori- 
ented (Cronbach & Suppes, 1969). In "models^^ of educational evaluation, 
decision making le usually viewed as thfe woi'k of<a single decision maker 
or decision-making body. When evaluatj)rs actually bec6me in\^lved' with/ 
educational institutions or agencies, they often discover that the r^ aw 
multiple stakeholders in the decision situation^ rather than just thej^speci- 
fied decision maker. For example, in every school system, the administj:.a 
tors and school board are ultimately responsible to the Gomrhuhity, and 
must be responsive to it. Results of an evaluation study must be cr^dfible 
to the various stakeholder groups in the comihunity, or the results will ' - 
not be useful, ' . *- 

Stakeholder groups usually haveaifferent ideological, theoretical, 
and practicctl perspectives. A sipnpli^tic example is the "humanist" groups 
versus the "back-to-basics" groups/in msj^ny oY ou;r comntunities. Each 
has a different viejv of thie primary functions of the schools and each group 

assigns different weights to educational outcomes. . For an evaiuaiion to 

% ■ ' . . ' . > v> > 

address the most important issues in a specific setting;, and rfeniain credible 

to the various stakeholder groups is a dilfficull; task,' 



This very real problem has been identified by many evaluators, 
' and several authors have recommended procedures for systematically . 
dealing with the discrepant views and values . of different stakeholders 
(Edwards & Guttehtag, 1975; Stake, 1975; Stenner, 1976). Edwards and 
Guttentag suggest a "multi-attribute utility analysis" that takes the dif- 
fering values into account when setting up the evaluation plan and analy- 
sis procedures. Stake has described an approach that he calls "Respon- 
sive Evaluation, " According to Stake, 

An educational evaluation* is responsive evalua- 
tion (1) if it orients more directly to program activi- 
ties than to program intents, (2) if it responds to audi- 
' ence 'requirements for inf6rmation, and (3) if the dif- 
ferent vaj^ue perspectives of the people at hand 'are re- 
ferred to in reporting the success and failure of ^he 
i program. (1975, p. 10) * . . 

Stenner uses "Policy Implications Analysis, " which asks members ofitlx^ 

■■ . ■ • - ■^ ■ 

various stakeholder groups to identify the types of evaluative information* 

'> . » 

and appropriate reporting formats that they would like from the evaluation 

y 

at the end of the program (pr other future date specified).^ 

Coleman (1972) discusses a related problem of the limitations of 
any one evaluation study, especia.lly if it is carried out from only.one theo- 
retical or disciplinary perspective. Every decision situation can be viewed 
from viirious perspectives, each of which may lead to very different deci- 
sions. His suggestion that a number of concurrent evaluations take place. 
Vising differejnt theoretical and disciplinary bases, is especially pertinent 
for dealing with multiple stakeholders. ' 

In decision sitviations, various groups are contending for limited re- 
sources. Each group will use whatever information is available to support 



it6 own! position, regardless of the quality of the data. If the evaluator^. 
; ^ants to contribute information useful to the decision-making process, 
(s)he miist attempt to represent the major differing perspectives and re- 

fl. i 

port the ^information as comprehensively and accurately as possible. My 

I, 

own expejriences as an evaluator, combined with the experiences recorded 

■ 1 * " 

.•V* 1 

in other major evaluation reports, have revealed that this task is an ex- 
tremely (iifficult one. v 

.1 • ^ ' 

A major problem in meeting the needs of the various stakeholders 

is that methodology is often used without recognizing the assumptions that 

are required for meaningful interpretation of the results for the specific 

situation. This relates to the role of ^'methodologist^' as discussed by 

Lazarsfeld jand Rosenberg (1955): 

' The term methodology . . . implies that concrete 
studies are being scrutinized as to the procedures they 
use, the underlying assumptions they make, the modes 
-of explanation they consider as satisfactory, (p. 3) 

In order to report data accui-ately and to make appropriate interpretations 

of results, a number of fundamental considerations about the use and inter 

pretation of empirical information is needed. 

\ Evaluation Data and Interpretation 

\ 

! 

In this paper several issues related to accurate* specification of 
data and interpretation of results are discussed. The focus of these com- 
y^^ments is to make an evaluation report more readily interpretable by reade 
who are not experts in research methodology or evaluation, which is often 
the case v/ith many educators and "lay^' groups. Several types of informa- 
tion are recommended for inclusion in reports,^ and examples are given 



where possible. The issues addressed deal with (a) the subjective basis 
of all data and its interpretation, (b) rationales for including variables 
and measures in an educational evaluation, (c) assumptions required for 
meaningful interpretation of data in a specific setting, and (d) inadequacies 
.of the experimental paradigm for evaluations. 

Subjective Basis of Empirical Information^ 

In general, laws, theories, variables, and measures in the behav- 
ioral sciences are man-made conceptualizations. For example, what is 
"reading" from one perspective is "symbol processing" from another. 
The measures that would be used for assessing each would be very differ- 
ent, and different researchers could use quite different measures of the 
same variable. 

Another example of the subjective basis of the interpretation of data 
is the varied uses of the ColoureosProgressive Matrices Test. Raven (196Z) 
developed it as a measure of nonverbal IQ^^^A^ich was believed at that time 
to be a genetic characteristic. The test scores were used in the evaluation 
of the National Follow Through Program (FT) as a nonverbal problem- 
solving measure, skills assumed to be learned and affected differentially 
by the various instructional models. 

In the design stage of an evaluation, the evaluator must decide which 
variables and measures to include. These decisions are based on the eval- 
uator's perspective 'of the progi^am, its context, and its purpose. This view 
is often discipline-based. For example, the evaluation of Follow Through, 
like most educational evaluations, utilized only academic achievement, 
self-concept, and individxial responsibility measures. This represents 



primarily a psychology-based view of education, in order to make sense 
out of an Valuation Buch as that of Follow Through, the reader must be 
aware of the evaluator's perspective of the program and the evaluation, 
and the evaluator's rationales for including whatever information is pre- 
sented. 

a 

In reviewing modern developments in the philosophy of science, 

Campbell (1974) indicated that: ^ ^ 

Non- laboratory social science is precariously sci- 
entific at best. But even for the strongest sciences, the 
theories belifeved to be true are radically under justified 
and have, at most, the status of '^better than" rather than 
the status of "proven. " All common-sense and scientific 
•knowledge is presumptive. In any se^fting in which we 
seem\o gain new knowledge, we do^ so at the expense of ' 
many presumptions, u ntej ^table - -to say nothing of uncon- 
firmable--in that sitv^^ation. While j:he appropriateness of 
some presumptions can be probed singly or in small* sets, 
this ckn only be done by assuming the correctness of the 
great bulk of other presumptions. Single presumptions or 
small subsets can in turn be probed, but the total set of 
presumptions is not of demonstrable validity, is radically 
* under justified, (p. 2) 

Conclusion-oriented researchers have the freedom, if not the responsi 
bility, to carry out their studies within a well-defined theoretical perspective 
in order to test the theory and contribute to knowledge within that perspec- 
tive, regardless of the extent to which it is justified by empift'ical evidence.' 
Evaluators in real-life situations/have the responsibility of providing in^ 
formation useful to that situation. In order to do this, the presumptions 
upon which the data and their interpretations are hAsed must be specified^ 
dnd the extent to which they are met in a particular situation must be esti- 
.mated. The presumptions, or assumptions, and the extent to which they 



are met in an evaluation, are discussed in the next two sections of this paper. 
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Rationales for Variables and Measures in an Evaluation 
^ Whenever an evaluation is planned, a wide range of variables and 

measures are initially identified for possible inclusion^ Some variables 
and measures are inevitably excluded during the Selection process. The 
evaluation contractor is usually most knowledgeable about the compro- 
mises and deletions that are made at this time. Unfortunately, a discus- 
sion of the selection process is seldom, if ever, included in an evaluation 
report. Thus, the best thinking about this problem and the rationales for 
the decisions are lost to the field and to society. They are also not avail- 
able to the stakeholderii , who need that information so that they can more 
appropriately assess the relative valine of the evaluation's conclusions as 
they rebate to decision alternatives. Without such a discvLssion, only the 
most knowledgeable reader will be able to recognize the limited nature of 
the evaluation,, and weight the possible alternatives apf5ropriately. 

A good example of sucl^ a discussion appeared in Design for the 
Individualized Instruction Study (Cooley & Leinhar'dt, 19*75). The ration- 
ale for excluding noncognitive variables in the evaluation cfesign was included 
as an dtjy>rendix to that study. It indicated the steps that were followed and 
->'the criteria that were used to arrive at the recpmmendations. Cooley and 
Leinhardt atso presented their rationale for using a standardized achieve- 
ment test to assess cognitive outcomes. The criteria utilized to compare 

pos siblel tests were delineated. The actual test reviews were included as 

V / 

far^other ap^iendix, in which the subtests of each achievement battery, the ^ 

psychometric characteristics, the available norms, and other character- ' --^ 

! • ^' 4 

isfics were described. 
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There are many pressures on evaluation contractoVs to make 
evaluation appear as comprehensive and competent as possible. Thus, 
the omission or weaknesses of the chosen set of variables and associated 
measures are seldom presented and discussed. Wfien they are not pre- 
sented," the reader may be left with the impression that all important vari- 
ables were included in the evalXAtion, and the' procedures used did meas- 
ure them adequat<^\^ (if not comprehensively). As a result, the particular 
groups that these measures favor will use the results to fight for a deci- 
sion that supports their position and give them more of the resoifrces. 

A good example of this type of use by a stakeholder involved the 
use of Follow Through (FT) evaluation results (Stebbins, 1976) by the 
Oregon FT Program and SRA (the publisher of DISTAR, a central com- 
ponent of the Oregon nriodcl). These results were immediately put into a 
short paper indicating that Oregon was the one successful FT model, yet 
no plans were being'made to provide additional funds for dissemination of 
this program. SRA disseminated these results broadly. 

A closer reading of the FT evaluation results would indicate the 
limited sense in which the Oregor^^model was the "most* successful. " The 
lack of clear articutation of the sense in which the model was "best" gave 

Oregon and^RA the' license to usjc the evaluation results and language to 

5 ^ ' 

their best political^advantage. \ , * 

In interpreting evaluative data, stakeholders may use it in ways 
that are inappropriate in the vie^' of the evaluator (although I am not say- 
ing that was the case with the Oregon and SRA-uses and interpretations). 
However, in any situation Ivhe re the rationalles^ and caveats do^ot ap^pear 
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appropriately in thr' report, tho cvaluator must tak<' t<onH* responsibility 
lor any niisuse. 

Assumptions Required for Meaningful Interpreta- 
' tions of Evaluative Data 

As indicated previously, all quantitative data are based on presunip- 
V tions about the data. Some of these are often the assumptions of the partic- 
ular statistical technique used to analyze the data. For example, the usual 
parametric assumptions about data for analysis of variance (ANOVA) in- 
elude: 

\ 1. Independent observations 

2. Populations are normally distributed 

3. Populations have equal variances 

,4. Variables are measured on interval or I'atio scales. 
If these assumptions are adequately met, then ANOVA results can be mean- 
ingfully interpreted. Much is ^known about the effects on ANOVA when data 
do not precisely meet the assumptions, and that knowledge must be con- 
sidered when deciding about the adequacy of the data in a specific situation. 
When more sophisticated techniques, such as multiple regression or ANCOVA 
are used, the assumptions are more numerous and the effects of failing to 



Ineet them precisely are usually not accurately specified. (See DiCostanzo 
& Eichelberger , 1977, for a discussion of information ndjeded to assess 
assumption^ required by ANCOVA. ) 
* In most settings^^ numerous other assumptions must Aso be met if 

interpretations that are meaningful for decision making are to be made. In 
general, these involve threats ^o internal and external vali^ity^^ a^N^escribed 

4 V 

by CarngbeU^ and Stanley (1963).and Bracht and Glass (1968). These threats 

\ '■ ■ ■ ' • ■'■ 

m , lu, • 



arc •fldom controlled adc'quat*:ly in any natural setting- - r specially one 
as complex as the educational setting, where buildings, teachers, adinin- 
istrators, and the social contexts of the schools vary so greatly. T};e 
particular strengths and weaknesses of the analytic techniques used and 
the confidence that one c^n have iiT the results and interpretations must 
be specified. 

Ln addition to these logical concerns, the relationship between the 

analyses being carried out and the evaluation question being addressed is 

based on assumptions about the data and the education setting that are often 

tenuou^. For example, in the FT evaluation the two major concerns to be 

addressed originally were: 

1. Assessing program impact on pupils, parents, 
schools, and community (Emrick, Sorensen, &r 
Stearns, 1 97 3, p. 72). 

Z. Assessing relative effectiveness of different 
programs and program approaches (Sorensen 
& Madow, 1969, p. 4).. 

1 ne evaluation design on which the FT final report was based essentially 
involved measuring pupil outcomes (academic achievement, self- concept, 
and individual responsibility for learning) at the end of third grade for 
pupils in the FT classrooms, and comparing the results to the outcomes 
for "similar"^ students not in FT classrooms. The differences in out- 
comes, after numerous covariates were used to adjust results statistically, 
were identified as '^program effects. In order to interpret the results as 
program effects, a number of assumptions had to be met. The simplest two, 
for illustrative purposes, were that: (a) the groups were initially similar, 
and (b) whatever differences obtained were due to different educational 
experiences of students. 
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If thv fir»t aflBiiniptioT) of the rvaludli(,)n (lcBi>^n, tlial the* two 
groupB werr similar, was rtict, then thr gcnrral qvicstion of thr imparl 
of the total national f^T program was addrt'BHccI to some extent. (KtM*p 
in mind that the FT program included psychological, me?dical, dental, 
and nutritional support, as well as the use of classroom aiders, etc., 
and not merely innovative educational programs. ) 

If the second assumption, that the cliildren experienced different 
educational programs, was also met, then the second evaluation qviestion 
was also addressed to some extent. The evaluators did not attempt^ to 
identify the differences in the educational experiences of the two groups 
(FT and non-FT) that were tested. jWI tha^is known about these »4jjdents 
is that one group participated in classrooms identified as ^'Follow Tlirough" 
and the other group did not. Other factors also existed that question the 
adequacy with which that second question was addressed. For ex- 
ample, the program effect'was measured by the adjusted differences be- 
tween a single FT i^ite and its non-FT comparison group; thus, each value 
was on a different^metric, and the relative effectiveness could not be ad- 
dressed directly. ' 

In tlie FT evaluation report (Stebbins', 1976) these assumptions w^re 
not specified, although sorne information was provided that described the 
similarity 6f the groups compared. The tenuousness of the inferences 
from the data to the interpretaltioh of results, as tfcey related to the ev^lu- 
atibn questions, was not presented in the report, however. This left the 
impression that the evaluation questions were indeed adequately (if not 

comprehensively) addressed. 

/ 

J 
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' >^ Jng:d€quacies of the Experimental Paradigm ' - 

American society, and espmiially the academic, community, have 

' . . ^ 1- * * . . •* > • , 

be.fen oversold on the applicability pf the experimental method to address 
.almost any type of question in dny type of setting. It is such a pervasive 
^ belief that if there is no <^itrol group nor tests of statistical 'significance 
"^in ^tf^evaluation study ,Sthe study is immediately SAispect. This view is 



s 



reflected in a quote from the evaluation of FT: 

. ' ' ' " ^ ■ • ' . \ • \ ■ 

It is an axiom of evaluation that in order to attribute ■ \ 

observed outcomes conclusively to a program, children - 4 

who participate in the program must be compared to simi- V 

lar children who do not.^Stebbins, 1976, p. A-45) 

Numerous authors have discussed the inadequacies of the experi- 
mental paradigm for educational research and evaluation in natural settings 
(e.g., Guba,'l965, 1977, Edwards & Guttentag, 1975). A major prot^lem > 
with the experimental paradigm is its assumption that a program is static' 
rather than dynamic, (i.e., the situation is such that an identifiable independ- 
ent variable is operating). Cuba (1965) questions the value of this assump- 
tion for educational programs (because programs must adapt to the educa- 
tional requirements of different kinds of students); and, he also questions 
the likelihood that the assumption is usually met. 

Edwards and Guttentag (1975) point out four kinds of dynamic changes 
that occur in educational programs: ' 

1. Thife values of those served by the program and 
thps^ who operate the program cTi^ge. 

2. Tlie program evolves--changes shape and char- 
. acter. ' i 

3. The extel^r^al circfiimstances to which the program 
is a;'response change.. ■■ ' \ 
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4. Knowledge of program events and consequences 
change (p. 415). ^ 

v.. \ ' , ^ ^ 

Each of these four types of cTianges occurred within the FT pro- 

grarn-»-as they will in any longitudinal program. Every FT sponsor's 

prograY^^ changed over the years. For ex^nple, FT sites associated ''/^ 

with tfee Learning Research and Development Center (LRDC) adapted the 

Center^s instructional materials to meet their particular needs. In addi- 

t^on/ some major changes were made^in 'the content of the kindergarten 

and .first grade curricula across ^11 sites. These changes were partly 

based on knowledge of events an4 consequences at the sites, and were ^ 

partly normal evolutionary changes. Also, in 1967-68, American society 

viewed the Head StarfPrdgram as a positive first step in compensatory 

education, which Follow Through was to continue. The original evalua- 

y 

tion issue in FT was to develop and identify the ^'best" or the "successfur* 
models. Later, in 1975-77, the vlftue of all compensatory education was 
being questioned, and the desired outcomes of primary education tended to ^ 
expand beyond the reading and rnath skills emphasized by FT and meaisured 
by the standardized tests used in the evaluation. 

In addition to these dynamic problems, it is frequently impossible 
to obtain a group of truly comparable groups of children in stable circum- 
stances that allow only the program or other treatment variable to operate. 
When Richard Anderson, Director of the FT evaluation for Abt Ass'ocil^te^, 
was asked (at the 1977 AERA convention) whether he would use a control 
group if he were to do anything like the FT evaluation again, he indicated 
that he Wodld not, because of the many problems that were ejqjerienced in 
obtaining comparable groups^ ; 
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. The appropriateness and adequacy of evaluative inforiftition pro- 

^ vided for decijBion making must-^be assessed in each situation. The tenu- 

/■' • ■ 

ousness of'ifiterpretations from complex experimental designs -with so- 
phisticated multivariate analyses must be recognized and reported accur- 
ately. In the words of John Tukey (1954), "Experimental statisticians 
should. be honest and expository about the rielation of precise assumptions 
and exactly optimum solutions to .real situations'^ (p, 719). The same 
types of assessments of results from, "responsive" and other types of 
evaluations are also needed. This is work for the methodologist as identi- * 
fied by Lazarsfeld and Rosenberg (1955). \ , s' . 

Summary 

^ The. thrust of this paper has been to point out that evaluations occur 
within a political decision-ma^ii^g mileau, where multiple stakeholders are 
contending for limited funds. Given the subj^^^e basis of empirical in- 
formation, different conclusions or recommendations about a program may 
result from different ideological, theoretical, and disciplinary perspectives. 
The logic behind the interpretation of results, and the assumptions necessary 
for such interpretations, must be specifi'ed and explained to facilitate the 
most appropriate use of an evaluatioji. ,r 

Each of the issues raised in thl^i^aper need further study and explica* 
tion if evaluatdrs are to learn how to provide the 'most useful information for 
decision making. Because of the complexity of pnany statistical technique^ 
presently used, much work is needed to identify what ass\imptions mus^t be 
met for meaningful and useful interpretations of results in a specific decision- 
making 8ituati<3^i, The, persons best prepared to do this f\indamental work ar,e^ 
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probably research methodologists who are not practicing evaluators* 
Perhaps we can coax our colleagues in research methodology to join us 



doing such needed work. 
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Footnotes 

ScriveiT^1972) discusses qualitative and quantitative sense in 

'^Objectivity and Subjectivity'Hn Educational Research. 

2 * 
Campbell (1974)' argues that the qualitative basis of quantitative 

data must be recognized and that both types of data are needed as cross- 

validating sources, . . 

# ■ ■ 

3 

Much of this discussion is taken from a paper written with 
James L. DiCostanzo (DiCostanzo & Eichelberger, 1977), 

4 ' 
, One oversight, in my view, was the lack of . some discussion 

of the inadequacies of the ,test battery that was selected by Cooley and 

Leinhardt for use in the evaluation^. v. 

5 • - 

Coleman (^972) differentiates between the world of action and 
t^e world of the disciplines. It is my view that in the world of action, 
persons bright enough to recognize such an opportunity (as in the FT 
evaluation) would consider it foolish* not to seize the opportunity. The 
Oregon and SRA usage is an ex;ample of stakeholders using whatever 
information is available to support their position. ' 

^The degree of similarity has been a. continuing problem for. the 
FT evaluators. Such problems , are identified for one program Sponsor 
byEichelberger (1977). 
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